ARISE-PIE: A People Information Integration Engine over the Web

نویسندگان

  • Vincent W. Zheng
  • Tao Hoang
  • Penghe Chen
  • Yuan Fang
  • Xiaoyan Yang
  • Kevin Chen-Chuan Chang
چکیده

Searching for people information on the Web is a common practice in life. However, it is time consuming to search for such information manually. In this paper, we aim to develop an automatic people information search system, named ARISE-PIE. To build such a system, we tackle two major technical challenges: data harvesting and data integration. For data harvesting, we study how to leverage search engine to help crawl the relevant Web pages for a target entity; then we propose a novel learning to query model that can automatically select a set of “best” queries to maximize collective utility (e.g., precision or recall). For data integration, we study how to leverage flexible forms of constraints as weak supervision to achieve collective information extraction from a target entity’s Web page corpus; then we propose a novel conditional probabilistic formulation to model constraints and an efficient realization to enable the inference with constraints. We evaluate our data harvesting and data integration solutions on the real-world data sets, and show that they both achieve better performance than the state-of-the-art baselines. We also evaluate our system on a benchmark data set and with a user study, in which we both show promising results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore

Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...

متن کامل

PIE: an online prediction system for protein–protein interactions from text

Protein-protein interaction (PPI) extraction has been an important research topic in bio-text mining area, since the PPI information is critical for understanding biological processes. However, there are very few open systems available on the Web and most of the systems focus on keyword searching based on predefined PPIs. PIE (Protein Interaction information Extraction system) is a configurable...

متن کامل

Domain resource integration system

Domain Resource Integrated System (DRIS) is introduced in this paper. DRIS is a hierarchical distributed Internet information retrieval system. This system will solve some bottleneck problems such as long update interval, poor coverage in current web search system. DRIS will build the information retrieval infrastructure of Internet, but not a commercial search engine. The protocol series of DR...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Search and Inference with Diagrams

We developed a process for presenting diagrammatic information on the World Wide Web such that the information is searchable and accessible to inference engines. We implemented this process for chart diagrams. Generally, information that is presented diagrammatically is stored in an image file. Typical search engines have access only to image tags and surrounding text. Hence, the information pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016